Predicting NBA Game Outcomes

By: Drew Hibbard

I will attempt to create a machine learning model that will be able to predict the winner of an NBA game as accurately as possible. I will treat this problem first as binary classification (Win/Loss) and then as a regression problem to try to predict the net score and total score. I will run numerous different classification models, with varying parameters, to see which model generates the best predictions. These will include:

  • Logistic Regression
  • Random Forest Classifier
  • K Nearest Neighbors
  • Polynomial Features
  • Standard Scaler
  • L2 Regularization (Ridge)

Gather the Data

I have gathered data from a few different sources, to compile the following data for games going back to the 2003 season:

  • Game date and season via kaggle
  • Home team and visiting team via kaggle
  • Outcome of each game via kaggle
  • Per-season statistics for each team via stats.nba.com:

    • win percentage
    • 3 pointers made
    • 3 pointers attempted
    • 3 point percentage
    • free throws made
    • free throws attempted
    • free throw percentage
  • Per-season advanced statistics for each team via stats.nba.com:

    • Offensive rating
    • Defensive rating
    • Net rating
    • Assist percentage
    • Assist/turnover ratio
    • Assist ratio
    • Offensive rebounding percentage
    • Defensive rebounding percentage
    • Total rebounding percentage
    • Turnover percentage
    • Effective field goal percentage
  • More advanced statistics from fivethirtyeight

    • RAPTOR (read the details behind the RAPTOR metric here: RAPTOR
    • Wins above replacement

Data Wrangling Done in Excel

  • Calculated days rest for each team for each game. Season openers were set to 0
  • Pull in all the team stats based on a lookup to (TeamID + Season)
  • Pull in RAPTOR data. This was provided per player per season. To get a team value per season I multiplied every player's RAPTOR score by their minutes played for that season. Then I summed per team-season.
In [1]:
# imports to get started
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
# load the data and check out the head

games = pd.read_csv('games.csv')
games.head()
Out[2]:
game_date_est game_id home_team_id visitor_team_id home_team_name visitor_team_name season home_days_rest away_days_rest home_win_pct ... away_ast_ratio away_oreb_pct away_dreb_pct away_reb_pct away_tov_pct away_efg_pct pts_home pts_away home_net_pts home_team_wins
0 3/1/2020 21900895 1610612766 1610612749 Charlotte Hornets Milwaukee Bucks 2019 2 2 0.354 ... 18.1 24.1 77.6 52.4 14.1 55.3 85.0 93.0 -8 0
1 3/1/2020 21900896 1610612750 1610612742 Minnesota Timberwolves Dallas Mavericks 2019 2 2 0.297 ... 17.6 27.5 73.8 51.0 12.7 54.8 91.0 111.0 -20 0
2 3/1/2020 21900897 1610612746 1610612755 Los Angeles Clippers Philadelphia 76ers 2019 2 3 0.688 ... 18.6 27.8 75.4 51.5 14.2 53.0 136.0 130.0 6 1
3 3/1/2020 21900898 1610612743 1610612761 Denver Nuggets Toronto Raptors 2019 2 2 0.661 ... 18.2 25.9 71.5 49.4 14.2 53.6 133.0 118.0 15 1
4 3/1/2020 21900899 1610612758 1610612765 Sacramento Kings Detroit Pistons 2019 2 2 0.438 ... 17.7 27.7 71.7 49.3 15.5 52.9 106.0 100.0 6 1

5 rows × 53 columns

In [3]:
games.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46250 entries, 0 to 46249
Data columns (total 53 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   game_date_est      46250 non-null  object 
 1   game_id            46250 non-null  int64  
 2   home_team_id       46250 non-null  int64  
 3   visitor_team_id    46250 non-null  int64  
 4   home_team_name     46250 non-null  object 
 5   visitor_team_name  46250 non-null  object 
 6   season             46250 non-null  int64  
 7   home_days_rest     46250 non-null  int64  
 8   away_days_rest     46250 non-null  int64  
 9   home_win_pct       46250 non-null  float64
 10  away_win_pct       46250 non-null  float64
 11  home_3pm           46250 non-null  float64
 12  home_3pa           46250 non-null  float64
 13  home_3p_pct        46250 non-null  float64
 14  home_ftm           46250 non-null  float64
 15  home_fta           46250 non-null  float64
 16  home_ft_pct        46250 non-null  float64
 17  away_3pm           46250 non-null  float64
 18  away_3pa           46250 non-null  float64
 19  away_3p_pct        46250 non-null  float64
 20  away_ftm           46250 non-null  float64
 21  away_fta           46250 non-null  float64
 22  away_ft_pct        46250 non-null  float64
 23  home_raptor        46167 non-null  float64
 24  away_raptor        46250 non-null  float64
 25  home_war           46167 non-null  float64
 26  away_war           46250 non-null  float64
 27  home_off_rtg       46250 non-null  float64
 28  home_def_rtg       46250 non-null  float64
 29  home_net_rtg       46250 non-null  float64
 30  home_ast_pct       46250 non-null  float64
 31  home_ast_to        46250 non-null  float64
 32  home_ast_ratio     46250 non-null  float64
 33  home_oreb_pct      46250 non-null  float64
 34  home_dreb_pct      46250 non-null  float64
 35  home_reb_pct       46250 non-null  float64
 36  home_tov_pct       46250 non-null  float64
 37  home_efg_pct       46250 non-null  float64
 38  away_off_rtg       46250 non-null  float64
 39  away_def_rtg       46250 non-null  float64
 40  away_net_rtg       46250 non-null  float64
 41  away_ast_pct       46250 non-null  float64
 42  away_ast_to        46250 non-null  float64
 43  away_ast_ratio     46250 non-null  float64
 44  away_oreb_pct      46250 non-null  float64
 45  away_dreb_pct      46250 non-null  float64
 46  away_reb_pct       46250 non-null  float64
 47  away_tov_pct       46250 non-null  float64
 48  away_efg_pct       46250 non-null  float64
 49  pts_home           46075 non-null  float64
 50  pts_away           46075 non-null  float64
 51  home_net_pts       46250 non-null  int64  
 52  home_team_wins     46250 non-null  int64  
dtypes: float64(42), int64(8), object(3)
memory usage: 18.7+ MB
In [4]:
games.describe()
Out[4]:
game_id home_team_id visitor_team_id season home_days_rest away_days_rest home_win_pct away_win_pct home_3pm home_3pa ... away_ast_ratio away_oreb_pct away_dreb_pct away_reb_pct away_tov_pct away_efg_pct pts_home pts_away home_net_pts home_team_wins
count 4.625000e+04 4.625000e+04 4.625000e+04 46250.000000 46250.000000 46250.000000 46250.000000 46250.000000 46250.000000 46250.000000 ... 46250.000000 46250.000000 46250.000000 46250.000000 46250.000000 46250.000000 46075.000000 46075.000000 46250.000000 46250.000000
mean 2.158012e+07 1.610613e+09 1.610613e+09 2010.866011 2.163805 2.163546 0.499423 0.499923 7.608954 21.285754 ... 16.835529 29.472566 70.514997 50.031075 15.124000 50.086709 100.669691 100.680174 -0.010443 0.499351
std 5.444878e+06 8.656656e+00 8.642509e+00 4.820667 1.114681 1.114137 0.146119 0.145750 2.456421 6.613790 ... 1.183503 2.740283 2.207034 1.351261 1.215048 2.279298 13.010531 13.007639 13.591515 0.500005
min 1.030002e+07 1.610613e+09 1.610613e+09 2003.000000 0.000000 0.000000 0.106000 0.106000 2.800000 8.200000 ... 14.100000 21.600000 65.000000 45.300000 11.900000 43.900000 33.000000 33.000000 -61.000000 0.000000
25% 2.060064e+07 1.610613e+09 1.610613e+09 2007.000000 2.000000 2.000000 0.402000 0.402000 5.900000 16.700000 ... 16.000000 27.500000 69.000000 49.200000 14.300000 48.400000 92.000000 92.000000 -9.000000 0.000000
50% 2.110027e+07 1.610613e+09 1.610613e+09 2011.000000 2.000000 2.000000 0.505000 0.505000 7.200000 19.900000 ... 16.800000 29.600000 70.500000 50.100000 15.100000 50.000000 100.000000 100.000000 0.000000 0.000000
75% 2.160014e+07 1.610613e+09 1.610613e+09 2015.000000 2.000000 2.000000 0.608000 0.608000 9.300000 25.400000 ... 17.600000 31.400000 71.900000 51.000000 16.000000 51.600000 109.000000 109.000000 9.000000 1.000000
max 4.180041e+07 1.610613e+09 1.610613e+09 2019.000000 13.000000 13.000000 0.838000 0.838000 16.100000 45.400000 ... 21.200000 37.300000 77.600000 54.100000 19.000000 56.900000 168.000000 168.000000 61.000000 1.000000

8 rows × 50 columns

In [5]:
# get a sense of which variables might be the best predictors
games.corr().home_team_wins.sort_values(ascending=False)
Out[5]:
home_team_wins     1.000000
home_net_pts       0.808187
pts_home           0.423956
home_win_pct       0.281627
home_net_rtg       0.273903
home_off_rtg       0.184565
home_raptor        0.179735
home_war           0.178115
away_def_rtg       0.169414
home_efg_pct       0.159187
home_3p_pct        0.139168
home_ast_to        0.125017
home_ast_ratio     0.120583
home_reb_pct       0.118812
away_tov_pct       0.078893
home_3pm           0.066430
home_dreb_pct      0.061905
home_ast_pct       0.061268
home_3pa           0.045058
home_ftm           0.040739
home_ft_pct        0.039340
home_days_rest     0.034058
home_fta           0.027415
visitor_team_id    0.024743
away_oreb_pct      0.005264
season             0.001915
game_id            0.000843
home_oreb_pct     -0.007638
home_team_id      -0.026256
away_fta          -0.028949
away_days_rest    -0.034470
away_ft_pct       -0.037660
away_ftm          -0.041582
away_3pa          -0.042329
away_dreb_pct     -0.058788
away_ast_pct      -0.062232
away_3pm          -0.063820
home_tov_pct      -0.080241
away_reb_pct      -0.117938
away_ast_ratio    -0.119757
away_ast_to       -0.123934
away_3p_pct       -0.139159
away_efg_pct      -0.156136
home_def_rtg      -0.168334
away_war          -0.177722
away_raptor       -0.179284
away_off_rtg      -0.182132
away_net_rtg      -0.272913
away_win_pct      -0.280616
pts_away          -0.423621
Name: home_team_wins, dtype: float64

Some of these features will be TOO good at predicting the game outcome, because the ARE the game outcome. But we will want to save those for later as the linear model targets.

In [6]:
games['total_pts'] = games['pts_away'] + games['pts_home']
total_pts = games['total_pts']
net_pts = games['home_net_pts']
games.head(1)
Out[6]:
game_date_est game_id home_team_id visitor_team_id home_team_name visitor_team_name season home_days_rest away_days_rest home_win_pct ... away_oreb_pct away_dreb_pct away_reb_pct away_tov_pct away_efg_pct pts_home pts_away home_net_pts home_team_wins total_pts
0 3/1/2020 21900895 1610612766 1610612749 Charlotte Hornets Milwaukee Bucks 2019 2 2 0.354 ... 24.1 77.6 52.4 14.1 55.3 85.0 93.0 -8 0 178.0

1 rows × 54 columns

In [7]:
# check for nulls
games.isnull().sum()
Out[7]:
game_date_est          0
game_id                0
home_team_id           0
visitor_team_id        0
home_team_name         0
visitor_team_name      0
season                 0
home_days_rest         0
away_days_rest         0
home_win_pct           0
away_win_pct           0
home_3pm               0
home_3pa               0
home_3p_pct            0
home_ftm               0
home_fta               0
home_ft_pct            0
away_3pm               0
away_3pa               0
away_3p_pct            0
away_ftm               0
away_fta               0
away_ft_pct            0
home_raptor           83
away_raptor            0
home_war              83
away_war               0
home_off_rtg           0
home_def_rtg           0
home_net_rtg           0
home_ast_pct           0
home_ast_to            0
home_ast_ratio         0
home_oreb_pct          0
home_dreb_pct          0
home_reb_pct           0
home_tov_pct           0
home_efg_pct           0
away_off_rtg           0
away_def_rtg           0
away_net_rtg           0
away_ast_pct           0
away_ast_to            0
away_ast_ratio         0
away_oreb_pct          0
away_dreb_pct          0
away_reb_pct           0
away_tov_pct           0
away_efg_pct           0
pts_home             175
pts_away             175
home_net_pts           0
home_team_wins         0
total_pts            175
dtype: int64
In [8]:
# get rid of rows with nulls since there are so few
games.dropna(inplace=True)
games.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 45992 entries, 0 to 46074
Data columns (total 54 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   game_date_est      45992 non-null  object 
 1   game_id            45992 non-null  int64  
 2   home_team_id       45992 non-null  int64  
 3   visitor_team_id    45992 non-null  int64  
 4   home_team_name     45992 non-null  object 
 5   visitor_team_name  45992 non-null  object 
 6   season             45992 non-null  int64  
 7   home_days_rest     45992 non-null  int64  
 8   away_days_rest     45992 non-null  int64  
 9   home_win_pct       45992 non-null  float64
 10  away_win_pct       45992 non-null  float64
 11  home_3pm           45992 non-null  float64
 12  home_3pa           45992 non-null  float64
 13  home_3p_pct        45992 non-null  float64
 14  home_ftm           45992 non-null  float64
 15  home_fta           45992 non-null  float64
 16  home_ft_pct        45992 non-null  float64
 17  away_3pm           45992 non-null  float64
 18  away_3pa           45992 non-null  float64
 19  away_3p_pct        45992 non-null  float64
 20  away_ftm           45992 non-null  float64
 21  away_fta           45992 non-null  float64
 22  away_ft_pct        45992 non-null  float64
 23  home_raptor        45992 non-null  float64
 24  away_raptor        45992 non-null  float64
 25  home_war           45992 non-null  float64
 26  away_war           45992 non-null  float64
 27  home_off_rtg       45992 non-null  float64
 28  home_def_rtg       45992 non-null  float64
 29  home_net_rtg       45992 non-null  float64
 30  home_ast_pct       45992 non-null  float64
 31  home_ast_to        45992 non-null  float64
 32  home_ast_ratio     45992 non-null  float64
 33  home_oreb_pct      45992 non-null  float64
 34  home_dreb_pct      45992 non-null  float64
 35  home_reb_pct       45992 non-null  float64
 36  home_tov_pct       45992 non-null  float64
 37  home_efg_pct       45992 non-null  float64
 38  away_off_rtg       45992 non-null  float64
 39  away_def_rtg       45992 non-null  float64
 40  away_net_rtg       45992 non-null  float64
 41  away_ast_pct       45992 non-null  float64
 42  away_ast_to        45992 non-null  float64
 43  away_ast_ratio     45992 non-null  float64
 44  away_oreb_pct      45992 non-null  float64
 45  away_dreb_pct      45992 non-null  float64
 46  away_reb_pct       45992 non-null  float64
 47  away_tov_pct       45992 non-null  float64
 48  away_efg_pct       45992 non-null  float64
 49  pts_home           45992 non-null  float64
 50  pts_away           45992 non-null  float64
 51  home_net_pts       45992 non-null  int64  
 52  home_team_wins     45992 non-null  int64  
 53  total_pts          45992 non-null  float64
dtypes: float64(43), int64(8), object(3)
memory usage: 19.3+ MB
In [9]:
# set x and y values
features = ['home_days_rest',
       'away_days_rest', 'home_win_pct', 'away_win_pct', 'home_3pm',
       'home_3pa', 'home_3p_pct', 'home_ftm', 'home_fta', 'home_ft_pct',
       'away_3pm', 'away_3pa', 'away_3p_pct', 'away_ftm', 'away_fta',
       'away_ft_pct', 'home_raptor', 'away_raptor', 'home_war', 'away_war',
       'home_off_rtg', 'home_def_rtg', 'home_net_rtg', 'home_ast_pct',
       'home_ast_to', 'home_ast_ratio', 'home_oreb_pct', 'home_dreb_pct',
       'home_reb_pct', 'home_tov_pct', 'home_efg_pct', 'away_off_rtg',
       'away_def_rtg', 'away_net_rtg', 'away_ast_pct', 'away_ast_to',
       'away_ast_ratio', 'away_oreb_pct', 'away_dreb_pct', 'away_reb_pct',
       'away_tov_pct', 'away_efg_pct']

target = ['home_team_wins']

Let's check out some of the best teams based on the different categories.

In [10]:
for col in games[['home_3pm','home_3p_pct','home_off_rtg','home_raptor','home_win_pct']]:
    print(games.groupby(['home_team_name','season']).max()[col].sort_values(ascending=False).head(5))
    print('\n')
    
print(games.groupby(['home_team_name','season']).min()['home_def_rtg'].sort_values().head(5))
home_team_name    season
Houston Rockets   2018      16.1
                  2019      15.4
Dallas Mavericks  2019      15.3
Houston Rockets   2017      15.3
                  2016      14.4
Name: home_3pm, dtype: float64


home_team_name         season
Golden State Warriors  2015      41.6
Phoenix Suns           2009      41.2
Golden State Warriors  2012      40.3
Sacramento Kings       2003      40.1
Phoenix Suns           2006      39.9
Name: home_3p_pct, dtype: float64


home_team_name         season
Dallas Mavericks       2019      115.8
Golden State Warriors  2018      115.0
Houston Rockets        2018      114.9
Golden State Warriors  2016      114.8
Houston Rockets        2016      114.1
Name: home_off_rtg, dtype: float64


home_team_name         season
Golden State Warriors  2017      73652.31927
Boston Celtics         2008      69147.21876
Golden State Warriors  2016      68847.57398
                       2015      67960.29471
San Antonio Spurs      2014      61239.20521
Name: home_raptor, dtype: float64


home_team_name         season
Golden State Warriors  2016      0.838
                       2015      0.830
Milwaukee Bucks        2019      0.815
Golden State Warriors  2014      0.804
San Antonio Spurs      2015      0.793
Name: home_win_pct, dtype: float64


home_team_name     season
San Antonio Spurs  2003      93.1
Detroit Pistons    2003      93.9
Indiana Pacers     2003      96.2
Brooklyn Nets      2003      96.4
Chicago Bulls      2011      97.5
Name: home_def_rtg, dtype: float64
In [11]:
import plotly.express as px
px.histogram(games,x='home_3pm')
In [12]:
px.histogram(games,x='home_raptor')
In [13]:
px.histogram(games,x='home_net_rtg')
In [14]:
from sklearn.metrics import classification_report, confusion_matrix

def run_tests(y,pred):
    print(classification_report(y,pred))
    print(confusion_matrix(y,pred))

Let's get a baseline model using only the winning percentages of each team, and try to beat it's score.

In [15]:
y = games[target]

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
lr = LogisticRegression()
pred_base = cross_val_predict(lr,games[['home_win_pct','away_win_pct']],y.values.ravel(),cv=5)

run_tests(games[target],pred_base)
              precision    recall  f1-score   support

           0       0.67      0.67      0.67     22996
           1       0.67      0.67      0.67     22996

    accuracy                           0.67     45992
   macro avg       0.67      0.67      0.67     45992
weighted avg       0.67      0.67      0.67     45992

[[15370  7626]
 [ 7591 15405]]

First let's run a Logistic Regression, using all features.

In [16]:
pred_all = cross_val_predict(lr,games[features],games[target].values.ravel(),cv=5)

run_tests(y,pred_all)
              precision    recall  f1-score   support

           0       0.66      0.66      0.66     22996
           1       0.66      0.66      0.66     22996

    accuracy                           0.66     45992
   macro avg       0.66      0.66      0.66     45992
weighted avg       0.66      0.66      0.66     45992

[[15247  7749]
 [ 7739 15257]]

Slightly worse actually. Let's try normalizing the predictive features prior to running the model, using the z-score method.

In [17]:
# normalize the features before predicting
from sklearn.preprocessing import StandardScaler
features_scaled = StandardScaler().fit_transform(games[features])
lr2 = LogisticRegression(max_iter=10000)
pred_scaled = cross_val_predict(lr2,features_scaled,y.values.ravel(),cv=5)

run_tests(y,pred_scaled)
              precision    recall  f1-score   support

           0       0.67      0.67      0.67     22996
           1       0.67      0.67      0.67     22996

    accuracy                           0.67     45992
   macro avg       0.67      0.67      0.67     45992
weighted avg       0.67      0.67      0.67     45992

[[15477  7519]
 [ 7524 15472]]

Hardly any better than the baseline model. Let's try a Ridge Regularization in an attempt to reduce the number of competing predictive features.

In [18]:
# Ridge Regression

from sklearn.linear_model import RidgeClassifier
ridge = RidgeClassifier()
pred_ridge = cross_val_predict(ridge,games[features],y.values.ravel(),cv=5)
run_tests(y,pred_ridge)
              precision    recall  f1-score   support

           0       0.67      0.67      0.67     22996
           1       0.67      0.67      0.67     22996

    accuracy                           0.67     45992
   macro avg       0.67      0.67      0.67     45992
weighted avg       0.67      0.67      0.67     45992

[[15460  7536]
 [ 7535 15461]]

Now let's try other classification models, Random Forest and K Nearest Neighbors, and see what happens.

In [19]:
# Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features_scaled,y.values.ravel(),test_size=0.3,random_state=22)
rf = RandomForestClassifier()

param_grid = {'n_estimators': [10,25,50,100,500,1000]}

grid = GridSearchCV(rf,param_grid,verbose=1)
In [49]:
grid.fit(X_train,y_train)
grid.best_params_
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed: 13.1min finished
Out[49]:
{'n_estimators': 500}
In [52]:
pred_rf = grid.predict(X_test)

run_tests(y_test,pred_rf)
              precision    recall  f1-score   support

           0       0.59      0.58      0.58      6919
           1       0.58      0.60      0.59      6879

    accuracy                           0.59     13798
   macro avg       0.59      0.59      0.59     13798
weighted avg       0.59      0.59      0.59     13798

[[3987 2932]
 [2760 4119]]
In [53]:
from sklearn.neighbors import KNeighborsClassifier
In [55]:
knn = KNeighborsClassifier(n_neighbors=2)
pred_knn = cross_val_predict(knn,features_scaled,y.values.ravel(),cv=5)
run_tests(y,pred_knn)
              precision    recall  f1-score   support

           0       0.55      0.79      0.65     22996
           1       0.62      0.35      0.44     22996

    accuracy                           0.57     45992
   macro avg       0.58      0.57      0.55     45992
weighted avg       0.58      0.57      0.55     45992

[[18173  4823]
 [15055  7941]]

Those were much worse. I guess the moral of the story here is, at least with the data I have available, you won't get much more accurate in choosing winners than simply choosing the team with the highest winning percentage. There is too much variability in game outcomes.

That being said, let's see if we can use Linear Regression to predict the net score (to be compared to the Vegas odds) and the total score of the game (to be compared to the over/under).

Linear Regression to predict the scores

In [14]:
total_pts = games['total_pts']
net_pts = games['home_net_pts']
In [15]:
px.histogram(games,x='total_pts')
In [16]:
px.histogram(games,x='home_net_pts')
In [21]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
In [22]:
import sklearn.metrics as metrics

def lin_test(y,pred):
    print(np.sqrt(metrics.mean_squared_error(y,pred)))
In [26]:
pred_lin_total = cross_val_predict(LinearRegression(),features_scaled,total_pts,cv=5)
pred_lin_net = cross_val_predict(LinearRegression(),features_scaled,net_pts,cv=5)
print('RMSE Total Points: ',lin_test(total_pts,pred_lin_total))
print('RMSE Net Points: ',lin_test(net_pts,pred_lin_net))
18.107159596368636
RMSE Total Points:  None
12.139998479354382
RMSE Total Points:  None

How does that compare to the standard deviations?

In [35]:
print('Total points STD: ',np.std(games['total_pts']))
print('Net Points STD: ',np.std(games['home_net_pts']))
Total points STD:  22.175621964386796
Net Points STD:  13.620769149911151

Let's try various polynomial features, and a ridge regression.

In [36]:
games_poly2 = PolynomialFeatures(degree=2,interaction_only=True).fit_transform(features_scaled)
games_poly3 = PolynomialFeatures(degree=3,interaction_only=True).fit_transform(features_scaled)
from sklearn.linear_model import Ridge

pred_total_poly2 = cross_val_predict(Ridge(),games_poly2,total_pts,cv=5)
pred_total_poly3 = cross_val_predict(Ridge(),games_poly3,total_pts,cv=5)

pred_net_poly2 = cross_val_predict(Ridge(),games_poly2,net_pts,cv=5)
pred_net_poly3 = cross_val_predict(Ridge(),games_poly3,net_pts,cv=5)

print('RMSE Total Points: ',lin_test(total_pts,pred_total_poly2))
print('RMSE Net Points: ',lin_test(net_pts,pred_net_poly2))

print('RMSE Total Points: ',lin_test(total_pts,pred_total_poly3))
print('RMSE Net Points: ',lin_test(net_pts,pred_net_poly3))
18.5833336473993
RMSE Total Points:  None
12.285793607905077
RMSE Net Points:  None
34.96962737420607
RMSE Total Points:  None
21.307199131677475
RMSE Net Points:  None

So the simplest model turned out to be the best, although it is not very predictive, with a RMSE of 18 compared to a standard deviation of 18 points.

At the end of the day, I would not base gambling decisions on this model, unless the spread or over/under is significantly different from the model prediction.

In [ ]: